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Abstract 

The mutual information between two jointly distributed random variables X and Y is a functional of the joint distribution 
PxY, which is sometimes difficult to handle or estimate. A coarser description of the statistical behavior of {X, Y) is given by 
the marginal distributions Px,Py and the adjacency relation induced by the joint distribution, where x and y are adjacent if 
P{x, y) > 0. We derive a lower bound on the mutual information in terms of these entities. The bound is obtained by viewing 
the channel from W to Y as a probability distribution on a set of possible actions, where an action determines the output for 
any possible input, and is independently drawn. We also provide an alternative proof based on convex optimization, that yields a 
generally tighter bound. Finally, we derive an upper bound on the mutual information in terms of adjacency events between the 
action and the pair (X, Y), where in this case an action a and a pair {x, y) are adjacent if y — a(x). As an example, we apply 
our bounds to the binary deletion channel and show that for the special case of an i.i.d. input distribution and a range of deletion 
probabilities, our lower and upper bounds both outperform the best known bounds for the mutual information. 


I. Introduction 

The mutual information I{X;Y) between two jointly distributed random variables X and Y arises as the fundamental limit 
in many information theoretic problems. When the alphabets X and y are small, the computation of I{X;Y) can be performed 
directly. This is the typical scenario when considering e.g. the calculation of capacity of memoryless channels, assuming the 
optimal input distribution is known. In many cases however, the alphabet may become large or even grow unbounded; this 
is the case e.g. with the capacity of channels with memory that are information stable HI, where the capacity is essentially 
given by the limit of /(X"; Y")/n, for the optimal input X". In such cases, it often becomes prohibitively difficult or even 
virtually impossible to precisely compute the mutual information, hence one must resort to bounding techniques. 

In many problems, the marginal distributions of X and Y are simple and the computation of the entropies H{X) and 
H{Y) is more tractable. In such cases the main obstacle becomes handling the joint distribution and computing the joint (or 
conditional) entropy. One such prominent example is the binary deletion channel ||2l with deletion probability d and an i.i.d. 
uniform input process. For this setting, the normalized output entropy is easy to derive and approaches (1 — d). However, to 
evaluate the joint distribution for any given input-output pair, one needs to find the number of different ways the output can be 
obtained from the input by deleting input bits. This is a difficult combinatorial question, and consequently computing the joint 
entropy is very challenging. A simpler combinatorial question is to determine whether the output can be obtained from the 
input by some deletion pattern. More generally put, instead of fully characterizing the joint distribution, it is sometimes much 
easier to characterize its support. Thus, the goal of this work is to provide bounds on the mutual information as a function of 
the marginals and the joint support. These bounds will be useful when the support is sparse. 

In what follows, we assume the alphabets X, y are finite unless otherwise stated. We say that x and y are adjacent if 
PxY{x,y) > 0, and we denote this relation hy x ^ y. We call the event lL(a; ^ y) an adjacency event. Our first main result 
is the following. 

Theorem 1: For any jointly distributed discrete r.vs (X, Y), 

I{X,Y) > -Ei.logExl(X ^ Y) -ExlogEv^j^ifc^ (1) 


Note that by Jensen’s inequality both summands are non-negative, and therefore as a corollary we also get that /(X, Y) > 
—Ey logExl(X ^ Y). One can find examples where both bounds are tight, e.g., for the mutual information between input 
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and output of the binary erasure channel. It is instructive to note that the weaker bound can be derived directly by the following 
argument. Draw an i.i.d. codebook with block length n according to Px, and use it to communicate over a memoryless channel 
Py\x- Consider the following decoding rule: If the output sequence y" is Py-typical and there is a unique codeword such 
that Xk ~ Vk for all k, output that codeword. Otherwise, declare an error. Clearly, Pr(Xfe Vk) = Ext{X - yk) and 
thus, assuming that y" is typical, the probability (averaged over random codebooks) that a specific codeword will satisfy the 
decoding rule is ~ Wy^y ^ Therefore, by the union bound, any rate below —Ey logEjf 1L(X ^ Y) can 

be attained by this strategy with vanishing error probability, and this in turn cannot be larger than the mutual information. A 
bound of this type was implicitly used in 13, ||4l- Our main contribution is therefore the second term in ([T]l- As we shall see 
in Section |Vl this additional term can be significant. 

Let us briefly provide the main ideas behind our approach. A channel is traditionally defined via a conditional probability 
distribution Py\x of the output given the input. Alternatively, a channel can also be (nonuniquely) defined as a random mapping 
Y = A{X) from an input alphabet to an output alphabet, where the actual mapping applied to the input, namely the channel 
action A, is drawn according to some probability distribution Pa over the set of all possible actions, independently of the 
input (see the functional representation lemma in ||5] Appendix B]). Following this paradigm, the mutual information for a 
given input distribution Px can be written as 

I{X-,Y)=H{Y)-H{Y\X) 

= H{Y) - {H{Y, A|X) - H{A\X, Y)) 

= H{Y) - H{A\X) - H{Y\A, X) + H{A\X, Y) 

= H{Y)-H{A)+H{A\X,Y) (2) 

where (|2]l follows since the action A is statistically independent of the input X, and Y = A{X). This holds for any eligible 
choice of action A. A natural quantity to consider is therefore the intrinsic uncertainty H{A\X,Y) associated with A, that 
captures the amount of information regarding the channel action revealed by observing its input and output. Note that for any 
eligible choice of A, we have that /(A; X, Y) = H{A) — H{A\X, Y) = P[{Y\X) is fixed, but the entropy of the action H{A) 
and the intrinsic uncertainty associated with the action can vary. 

As an example, consider the binary symmetric channel (BSC) with crossover probability 0 < p < A natural choice 
for the action A is drawing a r.v. Z ^ Bern(p) and setting A{X) = X (B Z. In this case, the entropy of the action is 
H{A) = hip), where /i(-) is the binary entropy function, and the intrinsic uncertainty H{A\X,Y) = 0, since viewing X 
and Y completely reveals the action (the noise Z). Another possible choice for the action A is drawing a ternary r.v. U with 
Pr(C/ = 0) = Pr(t7 = 1) = p, and Pr(C7 = 2) = \ — 2p, and setting 

A{X) = U -tiU ^2 )yX- tiU = 2) 

In this case, the entropy of the action is Tf(A) = h{2p)-\-2p, and the intrinsic uncertainty is HiA\X, Y) = il—p)-h > 0^ 

since if X = Y there remains some uncertainty regarding the action. Indeed, it can be directly verified that the identity 
hip) = hi2p) + 2p - if-p)-h holds. 

Following the above, in Section |II] we derive a lower bound on the intrinsic uncertainty for any given choice of the action 
A. This bound is based on an application of the Donsker-Varadhan variational principle. This will immediately translate into 
lower bounds on the mutual information. Our general statement, given in Theorem [3 is a family of bounds that depend on the 
particular choice of the action. While these bounds may be generally difficult to evaluate, we show in Section m that for any 
channel Py\x there always exists a specific choice of action, such that the associated bound depends only on the marginals 
and the joint support. This yields Theorem [T| 

The proof of Theorem [T] as delineated above in based on information theoretic arguments. Alternatively, the theorem can 
also be proved more directly using convex optimization techniques. In fact, this alternative approach does not only recover 
Theorem [l] but can also yield an increasing sequence of bounds that converges to the best possible lower bound on the 
mutual information in terms of the marginals Px,Py and the support of Pxy- Furthermore, while the information theoretic 
proof applies only to finite alphabets, the convex optimization approach can also handle countably infinite alphabets. This 
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result appears in Theorem |4] Section |IV] We note that the improved bounds obtained by this procedure seem quite difficult to 
evaluate in general. 

Interestingly, while actions were introduced in order to lower bound the mutual information, our results can be trivially 
leveraged to obtain upper bounds as well. 

Theorem 2: Let {X,Y) ~ Px x Py\x be jointly distributed discrete r.vs. Let A be any action consistent with Py\x^ be., 
such that Y = A{X). Then 

I{X-, Y) < H{Y) + logE;,,yl(A ~ (X, Y)) + Ex,r logE^ ^^^^^^^ ^(X F)) 

Proof: By (|2]) we have that I{X;Y) = H(Y) — 1{A\ X, Y). The proof follows by applying Theorem[T]to I{A-, X, Y). ■ 

Note that t{A ~ (X, X)) is an indicator on the event where A{X) = Y. In ([^i, the expectations are taken with respect to 
(X, Y) ~ PxY and A ^ Pa independent of (X, Y). Observe also that both the second and third terms in @ are non-positive, 
hence the bound holds even if one of them is removed. 

Lastly, in Section|V]we illustrate the applicability of our bounds in several specific examples. In particular, we provide simple 
examples showing that our bounds are sometimes tight, and demonstrating that the second term in ([T]l can be significant. We 
then consider the binary deletion channel for which the value of the mutual information is currently unknown for any nontrivial 
input process. For an i.i.d. uniform input, we evaluate our lower and upper bounds, and show that they both outperform the 
best known bounds on the mutual information. Finally, we draw a relation between the upper bound from Theorem |2] and a 
recent conjecture of Courtade and Kumar ffi). As all examples we consider in this paper involve binary channels, unless stated 
otherwise, all logarithms are taken to base 2. 

IT A Family of Bounds via Actions 

In this section we define a channel by its action on its input, and develop general lower bounds on the mutual information 
between the input and output in terms of the channel action, by bounding the associated intrinsic uncertainty defined below. 

A. Channels via Actions 

Let X^y be discrete alphabets. Any channel Py\x from X to y can be (nonuniquely) defined by a probability distribution 
Pa on a set A of mappings from X y, to which we refer to below as actions. Each action a(-) € A is defined for all 

possible inputs, and the channel action is chosen independently of the input, yielding the output Y = A{X) € 3^. 

For any eligible choice of action A, the intrinsic uncertainty of the channel with respect to the input distribution Px is defined 
to be H{A\X, Y). Note that while the intrinsic uncertainty may depend on the choice of A, the difference H{A) — H{A\X, Y), 
which was shown in SectionUto be equal to H{Y\X), does not; we therefore have the freedom to choose the action distribution 
that is most convenient to work with. 

Example 1 (Generic action set): For any channel Py\x we can always generate the action according to the following 
procedure. Let A consist of all functions from X i—S’ y, and for any a G ^ set Pa{o) = Ylxex ^Y\xio-{x)\x). 

Drawing A according to Pa, statistically independent of X, and setting Y = A(X), is equivalent to drawing in advance a 
sequence of statistically independent r.vs {Yx}x^x, where Yx PY\xi'\x), and then when X is revealed, outputting only the 
corresponding Yx- Thus, the above A and Pa are consistent with Py\x, they describe the channel Py\x- 

We further note that it is always possible to construct an action set with less than \X\ ■ iJtj actions, see the functional 
representation lemma in ||5] Appendix B]. Moreover, in many cases there exist “natural” choices of an action that describes 
the channel. In Section U we described such choices for the BSC. Below we provide a few more examples. 

Example 2 (Z Channel): The (symmetric) Z channel has a binary input X and binary output Y, such that Pr(y = 0|X = 
0) = 1 and Pr(y = 0|X = 1) = Pr(X = 1|X = 1) = ^. A natural choice for the action A is taking the action set A to 

consist of the two actions ai(a;) = x and a 2 {x) = 0 with probability assignment p{ai) = p{a 2 ) = 

Example 3 (Deletion Channel): In a deletion channel, each transmitted symbol is either deleted or received uncorrupted. 
Assuming the input to the channel is an n-dimensional vector X, the set A includes 2" actions, each corresponding to a 
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different subset of the input indices [1 : n] marked for deletion. In an i.i.d. deletion model symbols are independently deleted 
with probability d. Therefore the probability of an action a that deletes exactly w bits is P{a) = Different 

actions applied to the same input may result in the same output. For example, if x = 01100 we may get the output y = 110 if 
either the hrst and fourth symbols or the hrst and fifth symbols were deleted. Therefore, the intrinsic uncertainty 7T(A|X, Y) 
is generally positive. 

Example 4 (Trapdoor Channel): The trapdoor channel is a simple finite-state binary channel, defined as follows. Balls 
labeled “0” or “1” are used to communicate through the channel. The channel starts with a ball already in it, referred to as 
the initial state. On each channel use, a ball is inserted into the channel by the transmitter, and one of the two balls in the 
channel is emitted with equal probability. The ball that is not emitted remains inside for the next channel use. In this model, 
the channel’s action consists of choosing the initial state and deciding for each channel use whether to emit the ball that was 
already inside the channel or the ball that has just entered. Since an input x can be mapped to an output y via multiple actions, 
the intrinsic uncertainty is generally positive. 


B. Bounds 

Our main tool in lower bounding the intrinsic uncertainty is the variational formula of Donsker and Varadhan (See, e.g., 121 
Chapter 1.4]). We write D{P\\Q) for the relative entropy between the distributions P,Q, and Q ^ P \f P{x) = 0 implies 
Q{x) = 0. 

Lemma 1 (Donsker-Varadhan): For any distribution P and any nonnegative function f(x) for which Eplog/(Y) is finite. 


Eplog/(Y) = nhn logEQ/(Y) + i?(P||Q), 


(4) 


Q*(x) = 


and the minimum is uniquely attained by 

P{x)/f{x) 

Ep(l//W)’ 

where by convention we set 1/f(x) = 0 if f{x) = 0. 

For completeness, we bring the proof of this lemma. 

Proof: Let Q*{x) be as above. For any distribution Q we have 

D{P\\Q) + logEQ/(Y) = Ep log ^ + logEQ/(Y) 

= Ep log ^ + Ep log ^ + logEQ/(Y) 


(5) 


= ^ P{x) log 


P{^) ,_P(Y)/(X)Ep(l//(X)) 


(a) 


Q{x)f{x)V.p{l/f(X)) 

Ex P{x) 


-b Ep log ■ 


P{X) 


logEgfiX) 


= Eplog/(Y) 


+ Ep log/(Y) -flog Ep 


fix) 


logEgfiX) 


where (a) follows from the log-sum inequality 0 Chapter 2.7] which is tight if and only if Q(x) = Q*(x). ■ 

We would like to obtain an alternative expression for 

where the expectation is taken with respect to the joint distribution 


P(x,y,a) = P{x)P{a\x)P(y\x,a) 
= P(x)P{a)l(y = a{x)), 


and 1(P) is an indicator function for the event B. For brevity, we sometimes refer to this distribution as P. 
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Define the distribution 


Q{x,y,a) = 


A Pix,y,a)P{a\x,y) 


(7) 


EpP{A\X,Y) ’ 

which we sometimes refer to as Q. Using the Donsker-Varadhan variational principle with f{x,y,a) = 1/P{a\x,y), the 
expectation from Q can be written as 

1 , „ 1 


Elog 


PiA\X,Y) 


= logEg- 


D{P\\Q) 


PiA\X,Y) 

= ^ (PyWQy) + D {Px ,a\y\\Qx,a\y \Py), 


P{A\X,Y) 

where ([8]l follows from the chain rule of relative entropy. The marginal distribution Q{y) is given by 

Q{y) = '^Qix,y,a) 


1 


P{x)P{a)l{y = a{x))P{a\x, y) 


EpP{A\X, Y) 

^ Ex.AP{A\X,y) 

EpP{A\X,Y) ’ 

where in (|9]l we have used the fact that P{a\x,y) = 0 whenever y ^ a{x). Thus, 

= -H{Y) + \ogEpP{A\X, Y) - Ey logEx, aP{A\X, F). 


( 8 ) 


(9) 


( 10 ) 


In addition. 


logEg 


P{A\X,Y) 


= log^ 


Q{x,y,a) 

P{a\x,y) 


x,y^a 

= -logEpP{A\X,Y). 


Substituting (fTOl i and (fTTT i into ® yields 


( 11 ) 


H{A\X, Y) = -H(Y) - Ey logEx.AP{A\X, Y) + D {Px,A\Y\\Qx,A\y I Py) ■ (12) 


We are left with the task of evaluating the conditional relative entropy in (fTSl i. The conditional distributions that participate in 
this term are given by 


P{x,a\y) = P{x)P{a) 
Q{x,a\y) = P{x)P{a) 


t{y = a{x)) 

Ex,AHy = AiX)) 

P{a\x,y) 

Ex,APiA\X,y) 


(13) 

(14) 


and therefore 


D {Px,a\y\\Qx,a\y I Py) — Eplog 


tiY = A{X)) 
Ex^aHY = A{X)) 


Ex,aP{A\X,Y)\ 
P{A\X,Y) )■ 


(15) 


Unfortunately, an exact computation of (fTsT i involves the computation of Ep \og{\/P{A\X^ F)), which is the exact technical 
difficulty we are trying to avoid. Instead, we lower bound (fTSl) using the convexity of relative entropy, i.e.. 


D {Px,a\y\\Qx,a\y I Py) > D {Px,a\\Qx,j^ , 


( 16 ) 
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where 


Q{x, a) = Y^ P{y)Q{x, a\y) 
y 

P{a\x,Y) 


= P{x, a)Er- 


(17) 


Ex,aP{A\X, Y) ■ 

Note that other properties of relative entropy, such as the data-processing inequality or Pinsker’s inequality, could potentially 
be useful for bounding ( fTSl l. Combining (fThl l and (fTTI l gives, 

P{A\X,Y) 


D{Px ,aiy\\Qx,a\y I Py) > -Ex, A logEl- 

Substituting (fTsT i into (fl^ and using (|2]i yields the following. 


Ex,aP{A\X,Y)' 


(18) 


Theorem 3: Let {X,Y) ^ Px x Py\x be jointly distributed discrete r.vs. Let A be any action consistent with Pyix^ i-6-> 


such that Y = A{X). Then 


I{X-, Y) > -H{A) - Ey logEx,x-P(-4|X, Y) - Ex.x logEy 


PiA\X,Y) 

Ex,APiA\X,Y)- 


(19) 


III. A Bound via Adjacency Events 


An action A is called uniform if all actions in its support A are equiprobable, i.e., 

aeA 

a^A 

At this point, we restrict our attention to this class of actions, for which the bound in Theorem [3 takes a particularly simpler 
form that depends only on the marginal distributions of X and Y and their joint support. We then show that any channel can 
be essentially characterized by a uniform action, which in turn proves Theorem [T] 

For any x & X and y G 3^ let 



A{x,y) = {a 

be the set of all possible actions in A that map the input x to 

\A{x,y)\. 

Proposition 1: If A is a uniform action, then A conditioned 
Proof: 


■ a(x) = y} (20) 

the output y. Denote the cardinality of this set by N(x, y) = 


on X and Y is uniformly distributed over the set 


AiX,Y)^ 


where (a) follows from ll(y = a(x)) 


P(alx,y) 


P(x,yla)P(a) 

P(x,y) 

P(ylx,a)P(a) 


P(ylx) 

^ l(y = a(x))P(a) 
'^aeA(x,y) Pi^) 

(a) (a € A(x,y)) 

1 (a e A{x,y)) 


N{x,y) 


l(a G A(x, y)) and since P(a) = for all a G A. 


(21) 


'Note that the converse is not generally true. As a counterexample, consider the BSC with the action A{X) = JA 0 Z where Z ~ Bem(p). 










Lemma 2: Suppose Py\x can be represented by a uniform action A. Then, for any input distribution Px 


-Ey \og¥.x.AP{A\X, Y) = H{A) - Ey logExl(X ~ Y). 


(22) 


Proof: Using Proposition [T] 


W.x,AP{A\X,y) 


1 (a G A{x,y)) 




N{x,y) 

N{x,y) 


N{x,y) 
t{x - y) 


= ^Exl(X^y). 


Thus, 


—Ey logEx,AU(yl|2f, Y) = —Ey log — Ey logEjfl(2f ~ Y) 

= log 1^1 - Ey logExl(X - Y). 

The lemma follows since H{A) = log |,A| by the assumption that A is a uniform action. 

The next lemma lower bounds the last term in ( fT9b for channels with a uniform action A. 

Lemma 3: Suppose Ty|x can be represented by a uniform action A. Then, for any input distribution Px 


—Ex,A logEy 


PiA\X,Y) 
Ex,aP{A\X, Y) 


> —Ex log Ey 


1{X - Y) 
Exl(2f Y) 


> 0 


(23) 


(24) 


Proof: By virtue of Jensen’s inequality. 


, P{A\X,Y) ^ V.aP{A\X,Y) 

—Ex,A log Ey > -Ex logEy- 


Ex.,aP{MA, Y) 


Ex,AP{A\X,Yy 


Using (l2n i and (|2^ , we have 


t{a&A{x,y)) 


W.AP{A\x,y) EaPi<^)^ 


y) 


Ex,AP{A\X,y) jXExt(X^y) 

^ - y) 

~ Exl(X^y)’ 

establishing the first inequality in (l24l) . The second inequality follows by applying Jensen’s inequality again, this time w.r.t. 
Ex. ■ 

Combining Theorem [2 Lemma |2l and Lemma [2 establishes the following. 

Lemma 4: Suppose Py|x can be represented by a uniform action A. Then, for any input distribution Px 

1{X - Y) 


I{X- U) > - Ey logExl(X - U) - Ex logEy 


ExHX^Y)' 


(25) 


To establish our main result for any channel and input distribution, we first show the following. 

Lemma 5: Let Py|x be a channel with the property that P(jj\x) is rational for all x and y. Then there exists a uniform 
action for Py|x. 

Proof: For any channel Py|x with rational probabilities there exists some action set A = {oi, • • • ,a|^|} and a corre¬ 
sponding probability distribution Pa consistent with it such that all probabilities PA{ai), i = 1 ,. ■., |,4|, are positive rational 
numbers. For example, the construction from Example [T] yields rational probabilities PA{ai), * = |^|. We construct a 













new action A by duplicating each action to Mi identical actions, and assigning the probability PA{ai)/Mi to each of them. 
Clearly, the new action is also consistent with Py\x for choice of the natural numbers Mi, ..., M\a\- By our assumption 
that all original action probabilities are positive rational numbers, we can always find a choice of Mi,..., M\a\ ^uch that all 
new action probabilities are equal. For such a choice the action A will be uniform. ■ 

Using Lemma |4] and Lemma |5] we can now prove our main result. 

Proof of Theorem\I\ Any channel Py\x can be approximated arbitrarily well by a conditional distribution Py\x 
the same support whose entries are all rational, in the sense that max^, j, \PY\x{y\x) — PY\x{y\x)\ can be made arbitrarily 
small. This means that both Px x Py\x the corresponding marginal Py are arbitrarily close to Pxy and Py respectively. 
Since the mutual information I{X;Y) is continuous with respect to Pxy, the mutual information I{X\Y) between X and the 
output of the “rational” channel Py\x can be made arbitrarily close to I{X]Y). By Lemma|3 there exists a uniform action 
for Py\x^ ^tid consequently by Lemma |4] its mutual information is lower bounded by (l25l l. By continuity, I{X\Y) is also 
lower bounded by (1251) . ■ 


IV. A Convex-Optimization Based Bound 

In the previous section we have proved a lower bound on I{X;Y) that depends only on the marginal distributions Px, Py 
and the support of the joint distribution, namely, the function l(a; ~ y). Our proof relied on information theoretic arguments. 
In this section we will take a more direct approach to the problem, and derive bounds on I{X;Y) in terms of the same 
quantities, using convex optimization. More specifically, to arrive at a lower bound we minimize I{X;Y) w.r.t. Pxy subject 
to the constraints that the marginal distributions are Px,Py, and that PxY(x,y) = 0 whenever ^ y) = 0. Throughout 
this section we assume all logarithms are in the natural basis, while the result of Theorem |4] remains valid as long as the same 
logarithmic basis is applied to I{X;Y). 

We consider the following problem: 


minimize I{X;Y) 
subject to: 

^ PxY{x,y) = PY{y) Vy e V 

x:xr^y 

PxY{x,y) = Px{x) Va; G -T 

y.yr^x 


PxY{x,y) > 0 if a; 
Note that the constraints above imply ^ Pxy{x, y) 

minimize 

x~y 


^y, PxY{x,y) = 0\i X OO y. 

= 1. This is equivalent to 

PxY{x,y) 


PxY{x,y)\og 


Px{x)PY{y) 


subject to: 

y^ PxY{x,y) = PY{y) Vy G V 

x'.xr^y 

^ PxY{x,y) = Px{x) Vx € X 

y.yr^x 

Px,Y{x,y)>{) y{x,y) G X X y,xy 

This objective function is convex in Pxy{x, y), and the constraints are linear, so the optimization solution can be obtained by 
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the solution to the dual problem given by 

L= inf sup PxYix,y)\og . 

PxYi^,y) x^,tj.ym,T^y>o Px{x)PY{y) 


X! PxY{x,y) - Px^xU 


yZ PxY{x,y) - PyiTj)] -^Ty,yPxY{x,y) 


\x:xr^y 


xr^y 


inf yZ Pxy(.x, y) (log - K - \xPx(x) + ^yPyiv) (26) 


sup 

PxY{x,y) 

sup inf '^PxY{x,y) {log PxY{x,y) - axy)+ '^XxPx{x) p'^yyPxiy) 

Xa:,tJ-ym,r^y>OPxY{x,y)^ ^ ^ 


Xr^y 


where (l26] t follows from the minimax theorem and 

axy — \og{Px{x)PY{y)) + Xx + yy + Txy 

The function f{x) = xlogx — ax is minimized at x* = e““^ and its minimal value is f{x*) = —e““^. Using this, we get that 

L = sup _glog(-Px(a:)Pv(y))+Ax+Aia+'rxH-l E KPx{x) + y^yyPy(y) 

Ax,/iyeR,Tj,„>0 ^ y 

= sup '^-Px{x)PY{y)e^’^~^'"''~''^^'>~^+'^XxPx{x) P'^^yPyiv) 

Ax,/iyeR,Tj;j,>0 ^ y 

Clearly, the maximizing Txy is Txy = 0 which gives 

L= sup -Px(a:)Py(y)e^"+^"~^ + XxPxjx) + y^yyPy(y) 

Ax.MyGR 


= sup y]-Px(x)Py(y)e^*+'"’'+y] Ay:Px(a;) + y^/iyPy(?/) + 1 




gAi+My 

' xr^y X y 

where in the last step we replaced y,y with y.y — 1 (with some abuse of notation). Let A and /r be the vectors holding {Xx}xex 
and {yy}y^y, respectively, and 

G(A, -Px(x)Py ( 2 /)e^-+'^« + y] XxPx{x) + y,Py(y) + 1, 

xr^y X y 


such that 


L= sup G{X,fj,). 

AGRl-^fAieR'^' 

We will use the alternating minimization approach to minimize —G(A, /r) (which is equivalent to maximizing G(A, /r)) 
over X This approach is described as follows: for arbitrary initialization of A*^°\ we use an iterative algorithm to 

successively minimize the target function. In fc-th iteration, we first hold fixed and minimize the target function over 

(1 to obtain and then hold fixed and minimize the the target function over A to obtain \l-^\ In mathematical 

forms, for fc > 0, we have 

G argmin-G(A('=),/x), 

_\(fe+i) g argmin —G(A, 

A 

The alternating minimization approach is widely used in optimization where separate optimization over different parameter 
subsets is much easier than the joint optimization, e.g., in the expectation-minimization (EM) algorithm @ to find the maximum 
likelihood estimator, in the Blahut-Arimoto algorithm ifTOll . ifTTll to maximize the mutual information between channel input 
and output, in minimizing the Kullback-Leibler divergence between two convex sets of finite measures ca, to name a few. 
One remarkable property of this approach is that, by definition we have 

G(A(°\/x(°)) < G(A(^),< G(A(^\< G(A(2 \<-<L 


(27) 
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i.e., the value sequence obtained by this approach is non-decreasing and must have a limit. We remark that since G{X, /r) is 
jointly concave with respect to (A,/x), the alternating minimization approach converges to the global optima ifTSl Proposition 


2.7.1], i.e., 


lim = lim =L. 

k—^oo k—^oo 


Next, we derive the expression of A^*^) and obtained from the alternating minimization procedure. Initially we set 

=0, Vx € X. For fc > 0, by the definition of in the alternating minimization we have =0, Vy G 3^, which 

dtA, 

gives 


e ^ Px{x)e^=^\ ^y^y- 

x:xr^y 

Similarly, for we have 

^ ^ YxeA”. 

y.yr^x 


(28) 


(29) 


Based on ( I29I ) and ( I28I ). it is straightforward to verify that the first two iterations of this procedure yield 
L>G (a(°\ = -Ey logExl(A: ~ Y) 

L>G (a( 1), = -Ey logE;^ 1(X ~ y) - E;f logEy 

in agreement with the bound derived in Theorem [T] Continuing with this procedure we can further improve our bound. To 
characterize the bound after k iterations, we introduce the functions T^\Px{x), Pyiy), l(x ^ y)), Ty^Pxix), Py (y), lL(x ^ 
y)) that are defined recursively as 

4°^(Px(x),Py(y),l(x^y)) = l, (30) 


and for fc > 0, 


Ti!^\Px{x),PY{y),Hx 

4"+')(Px(x),Py(y),l(x 

It can be easily verified by induction that 


y)) = 


y)) = Ey 


1{X ~ Y) 


T^xHPx{x),PYiy),t{x ^y)), 


i{x -y) 


,4"^(Px(x),Py(y),l(x-y)); ■ 


G (a«, /r«) = - E;f logPi?^(Px(x),Py(y), l(x ~ y)) - Ey logr|")(Px(x), Py(y), l(x ^ y)) 

G (A('=+i),/rW) = - Ex logP^"+'^(Px(x),Py(y), l(x ^ y)) - Ey logT|.'=)(Px(x), Py(y), l(x ^ y)). 

Thus, we have arrived at the following theorem. 

Theorem 4: For any jointly distributed discrete r.vs (AT, y) and any k>0, 

/(y; y) > - Ex log (Px (x), Py (y), 1 (x - y)) - Ey log 4"^ (Px (x), Py (y), 1 (x ~ y)) 

> - Ex logr^''^(Px(x),Py(y),l(x ~ y)) - Ey logT^''^(Px(x), Py (y), l(x ~ y)). 


(31) 

(32) 


V. Examples 

In this section we evaluate the bounds derived in Theorems [T] and |2] and when possible also those from Theorem @1 for 
four examples. The following simple lower bound on I{X;Y) will serve as our baseline for demonstrating the improvement 
attained by applying the bound from Theorem [T] 

Proposition 2: For any jointly distributed discrete r.vs (AT, y). 


/(AT, y) > -Ey iogExi(y - y). 


(33) 
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Similar to the bound from Theorem[T] the bound above is given in terms of the marginals and the joint support of {X, Y). 
However it is weaker than the former bound as it can be obtained from it directly by applying Jensen’s inequality on the 
second term of ([T]i, which gives —Ex logEy ~ Exi(x~y) “ Section |I] we also gave an operational 

proof of this bound. 

A. Erasure Channel 

The binary erasure channel has input X € {0,1} and output Y G {0,l,f} such that Pr(y = x\X = x) = 1 — e and 
Pr(F = £\X = x) = e for any x. For X ^ Bern(p) we have Pr(F = 0) = (1 — e)(l — p), Pr(y = 1) = (1 — e)p and 
Pr(y = £) = e and the mutual information between the input and output is Ip{X;Y) = (1 — e)h{p). For this channel x ^ y 
if and only if either x = y or y = £, and therefore 

-Ey logExl(X - r) = -(1 - e)(l -p)log(l -p) - (1 - e)plog(p) - elog(l) 

= (1 

Thus, for this channel our lower bound from Theorem [T] as well as the weaker bound from Proposition |2] are tight. 

In order to evaluate our upper bound from Theorem |2] we need to choose an action A consistent with Py\x- We take the 
natural action set, that consists of two actions, ai{x) = x and a 2 {x) = £ with p{ai) = 1 — e and ^( 02 ) = e. For this choice 
we have 


ExlogExy/(A - {X,Y)) = p(ai) logPr(X = Y) + ^( 02 ) logPr(r = £) 

= (1 - e) log(l - e) + e log(e) 

= 

Since H{Y) = h{e) + (1 — e)h{p), the upper bound from Theorem |2] is tight and gives 

IpiX-Y)<il-e)h{p). 


B. Z Channel 


The (symmetric) Z channel has a binary input X and a binary output Y such that Pr(y = 0|X = 0) = 1 and Pr(y = 
0|X = 1) = Pr(y = 1|X = 1) = i. For X ~ Bern(p) we have Y ~ Bern(|), and the mutual information between the input 
and output is Ip{X;Y) = h(|) — p. For this channel x ^ y if and only if {x, y) ^ (0,1) and therefore Exl(-^ ~ 0) = 1 and 
Exl(X - 1) = Pr(X = 1) = p. We have 


-EylogExl(X ~ Y) = -Pr(y = 0)logExl(X - 0) - Pr(y = l)logExl(X ~ 1) 

= -flog(F), 


and 


—Ex logEy 


1(X ~ Y) 
ExHX^Y) 


-(l-p)log(l-|)-plog((l-|)+0 
1 - (1 -p)log(2 -p) -plog(3 -p). 



1 ( 1 - 0 ) 

1 


(34) 


p l(l-l) A 
2 p J 


(35) 


Thus, Proposition |2] gives 

Ip{X;Y) > -|log(p), 


(36) 
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and Theorem [T] gi\ 


IpiX;Y) > 1 - |log(p) - (1 -p)log(2 -p) -plog(3 -p). (37) 

For comparison, we also take a look at a further refinement given by Theorem |4l By the definitions of and 


know that 

TPiPx{x),PYiy),Hxr~^y))=ExliX r^Y) = l{Y = 0)+pliY =1), 


rW(Px(ai),Pr(y),l(a:~y))=Ey 


1L(X - Y) 


^T^°\Px{x),PY{y),l{x^y)) 


(> 




i) 1{X = 0) 


= 0 ) + 1 - 


= Ey 

P ,P l\ 
2 2' pj 


1L(X - Y) 


1(Y = 0)+pl{Y = 1) 
1{X = 1) 


T^^\Px{x),PY{y), l{x^y))=Ex 


(l-|)l(A' = 0)+(|-|)l(X = l), 

i(x-y) 




T«(Px(cr),Py(y),l(x^y))^ 

1 1 
— +P' 3- 


= Ex 


1{X ~ Y) 


1 - 


2-2p 

2-p 


2p \ 

3-pJ 


P 

2 2 


i(r = o)+ (i-p)-o + p 


(l_|)l(X = 0)+(|-f)lL(X = l)^ 
1 


3 _ P 
2 2 


iL(r = i) 


i(r = 0) + 


2p 

3-p 


1{Y = 1). 


As a result. Theorem |4] gives 


IpiX]Y) > -Ex logT^^^(Px(a:),PY(y),l(x - y)) -EYlogT^\Px{x), PY{y),t{x - y)) 


-(1 -p)log (l - I) -plog 


3-p 




2 — 2p 2p 


2 — p 3 — p 

log(2 - p) + (1 - p) log(3 - f) - I log(p) - (l - I) log(3 - 2p). 


P 1 

-2 log 


2p 

3-p 


(38) 


The bounds from (l36l l. (ITtT i and dSST l are plotted in Figure [T] as a function of p along with the exact value of Ip{X;Y). It 
can be seen that the lower bound from Theorem [T] is significantly tighter than the one form Proposition |2] and it is quite close 
to Ip{X\ Y) for all values of p. The lower bound from Theorem |4] is even tighter. 

In order to evaluate the upper bound from Theorem |2] we use the natural action ai(a:) = x and 02 ( 0 :) = 0 with p(ai) = 
p(a 2 ) = For this choice oi ^ (x, y) if and only if x = p and 02 ~ (x, y) if and only if p = 0, and therefore Exyl(ai 


(a:, F)) = Pr(X = F) = 1 - f and ExYl(a 2 ~ [X, F)) = Pr(F = 0) = 1 - f. We have 

Ex logExYl(^ {X, Y)) - log (1 - I) , 


(39) 


and 


]Ex,y log Ex 


1(A^(X,F)) 
Ex,y1(A~ (F,F)) 


: Ex.y logExl(v4 - (X, F)) - log (l - |) 


= Pr(X = 0, F = 0) log(l) + Pr(X = 1, F = 0) log + Pr(X = 1, F = 1) log - log (l - |) 

= -p-log(l-|). 

Recalling that PHY) = /i(|) and applying theorem |2] we get 

Ip{X- Y) <h{£j+ log (1 - I) - p - log (1 - I) 


(40) 


= h 


(I) 


■p, 


which is tight for any p. 
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Fig. 1. 


Ip{X \ Y) for the Z channel together with the three lower bounds from I36t . j37t and )38t as a function of p. 


C. Binary Deletion Channel 

The binary i.i.d. deletion channel operates by independently deleting input bits with probability d. In this subsection, we 
apply Theorem [T] and Theorem |2] to obtain lower and upper bounds on the mutual information for an i.i.d. uniform input 
process. Both bounds outperform the best known bounds in some regimes of deletion probabilities. In general, tighter lower 
bounds can be obtained by applying Theorem |4] with higher values of k. However, as will be demonstrated below, even the 
task of computing the bound from Theorem [1] (corresponding to Theorem |4] with fc = 0) is quite challenging. 

Lower Bound for an i.i.d Uniform Input 

We apply Theorem [T] to obtain a lower bound for /(X; Y) under a uniform i.i.d. input distribution X ~ Unif ({0,1}"), 
which outperforms the best known bounds for i.i.d inputs IITtI . ITSl . Since the deletion channel is information stable, any 
rate smaller than the associated lim„_>oo Y)/n is achievable with uniform i.i.d. codebooks. Note that for a uniform i.i.d. 
input, the output Y is also uniform i.i.d. given its length 0n, where the latter is binomial with parameters (n, 1 — d). 

For the i.i.d. deletion channel l(x ~ y) indicates whether or not y is a subsequence of x. For 0 < f < 1, define the 
operation (f) = max(f, 1/2). According to O Lemma 3.1], for any y of length 9n we have 

^ l(x^y)= M (41) 

x6{0.1}" 3 =Sn 

where h{-) is the binary entropy function, and = denotes exponential equality in the usual sense. This implies that for any y 
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of length On we have Exl(X ~ y) = The function h{{9)) is concave in 0 and therefore 


- lim -Ey logExl(X ~ Y) = -E© (/i((0)) - 1) 

n—^oo 77 , 


> l-/i((E0)) 


(42) 


where 0 is the normalized (random) length of Y. 

The right hand side of (l42l) is a well known lower bound for the deletion channel capacity, obtained with a uniform i.i.d. 
input 0. We now evaluate the second term in O in order to improve upon this bound. To this end, we first parse each 
X € {0,1}" into phrases that contain exactly two bit flips and end immediately after the second flip. For example, the 
string 0001111011001110001 is parsed into the three phrases 00011110,11001,110001. We identify each phrase with three 
parameters: b G {0,1} is the first bit in the phrase, fci > 2 is the index of the first flip in the phrase, and ^2 > 1 is such that 
ki + k 2 is the total number of bits in the phrase. In our example, the three phrases correspond to {b = 0, fci = 4, ^2 = 4}, 
{b = 1, fci = 3, ^2 = 2} and {b = l,ki = 3, ^2 = 3}, respectively. For any pair of integers 2 < ki < n, 1 < k 2 < n let 
be the number of {ki, fc 2 }-phrases in the parsing of x. For e > 0 we define the typical set 


5, 4{xe{o,i}" 


n 


1 

5 


2“(fei+fe2—1) 


< e 


V 2 < fci < n, 1 < k 2 < n 


}■ 


It holds that for any e > 0 and n large enough Pr(X G Se) is indeed arbitrary close to 1. To see this, define the three i.i.d. 
mutually independent processes 


Bi ~ Bern(i), i.i.d. 

Kii ^ 1 + Geometric(i), i.i.d. 
K 2 i ^ Geometric(4), i.i.d. 


and note that an i.i.d. Bern(i) random process is equivalent to the process obtained by stacking the random phrases {Bi, Ku, K 2 i} 
one after the other. Moreover, the probability of such a random phrase being of type {fci, ^ 2 } is ^nd the expected 

length is E(iTii + K 2 i) = 5. In our setting, X is an n-dimensional i.i.d. Bern(i) random vector. Thus, X can be generated 
by stacking exactly n/5 random phrases [Bi, Ku, K 2 i} one after the other and either removing the last bits if the length 
of the obtained vector is greater than n, or appending i.i.d. Bern(i) bits to the vector if its length is smaller than n. Since 
the expected length of a phrase is 5 bits, for any (5 > 0 the number of removed/appended bits is w.h.p. smaller than Sn. 
Therefore, the contribution of these bits to the distribution of the phrase lengths in the parsing of X is negligible, and we get 
that Pr(X G 5e) —1 with n, by the law of large numbers. 


For n large enough we can write 
1L(X-Y) 


—Ex log Ey 


Exll(X-Y) 


= - Pr(X G 5e)Ex|5, logEv 




Exl(X-Y) 


> — Pr(X G Se) logEv 


Ex|5.1(X~Y) 

Exl(X-Y) 


-Pr(X^5,)logEY- 


Ex|5^1(X^Y) 
Exl(X^ Y) 


> —(1 — e) logEv 


Ex|5.1(X^Y) 
Exl(X~ Y) 


(43) 


where the first inequality follows from Jensen’s inequality and in the second we have used the fact that Exl(X y) > 2“" 
for any y, and therefore 1(X ~ Y)/Exl(X ^ y) < 2" for any y, along with Pr(X G S^) > 1 — e. Recalling that 0 is the 
normalized (random) length of Y, we take the expectation Ey as EoEyi© and use (HTt to obtain 

^^ ^ExICX^^^yT " ®^e2"(i-''«e»)EY|e]Ex|5.1(X - Y) 

= Pr (X - Y|0, X G 5,). (44) 


Now, consider a greedy algorithm for determining whether y is a subsequence of x, defined as follows ||2] Section 3.1]: 
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Scanning from left to right, take the first bit in y and match it with its first appearance in x. Then take the second bit in y 
and match it with its subsequent first appearance in x. Continue until either x or y are exhausted, where the latter case is 
termed success. It is easy to see that the greedy algorithm succeeds if and only if x ~ y. For statistically independent random 
vectors X and Y, we enumerate the phrases in the parsing of X by i = 1,, M(X) where M(X) is the (random) number 
of phrases in X. The vector Y consists of 0n i.i.d. uniform bits. To simplify computations, we construct a vector Y' of 
length n by taking Y and possibly padding it with i.i.d. bits. We define the random variables Zi as the number of bits in Y' 
that are matched to bits in the ith phrase of X by the greedy algorithm. Under this construction, the events Zi > Qn} 
coincides with the event {X Y}, since the additional random suffix does not affect the event where the first Qn bits in Y' 
are matched. Under this assumption the Zi’s are clearly mutually independent, given that the phrase types {ku, of 

X are known (but assuming that their first bit identifiers remain random). Of course, the distribution of Zi depends 

on the parameters ku, k 2 i that correspond to the ith phrase in X. In the appendix, we show that given Ku and K 2 i, the (base 
two) moment generating function of Zi is 

^E{2*^^\Ku = ki,K2^ = k2) 

1 _ 9fci(i —1) / 1 _ — 

_ 2ki{t—i) _|_ _I ——— _I- 

1 -2‘-i V 1 - 

Noting that by definition, for X. G the number of phrases M (X) and their composition A 2 (’x) is essentially deterministic, 

we can use Chernoff’s bound ifT^ to obtain 

/m(x) 

Pr(X - Y|0 = 6», X e 5,) = Pr ^ > 0n|0 = 6», X e 5, 

\ ^=1 


where 


A*(0) = max ( i E E log (t) ] . 


00 CX3 


i>0 . 

\ ki—2 k2 — l / 

Substituting into (l43T l and (l44li . and applying standard large deviations arguments, we obtain 

Y) 


- lim -ExlogEy ^ 

n-)-oo n ~ Y) 


> 9{d) 


where 


g{d) 4 ^mm^Z?2(0||l - d) - {1 - h{{9))) + A*{0) 

where D 2 {p\\q) is the binary relative entropy function. It follows that for a uniform i.i.d. input distribution, 

lim i/(X;Y) > l-/i(min(d,l/2))+p(d). (45) 

n—)-oo Ti 

Numerical evaluation of the term g{d) reveals that it is greater than zero for all d < 1/2. Thus, (l45l l improves over Gallager’s 
well know bound 1 — h{d) lfT4l . Recently, Rahmati and Duman ifTSl used a different technique to lower bound the mutual 
information for uniform i.i.d. inputs. For small values of d their bound is better than (l45l l. but for larger values of d the right 
hand side of (l45T l turns out to be greater than their bound. For example, for d = 0.2 our bound improves on 1 — h{0.2) by 
Ri 0.0117 bits (roughly 5%), whereas the improvement of lITSl is negligible. See Figure |2] 

Upper Bound for i.i.d Inputs 
By Theorem |2] we have in particular that 

/(X;Y) <iT(Y)+EAlogEx,Yl(Al^ (X,Y)) (46) 

Let X be an i.i.d. Bern)^) input vector of length n for some q < i . It can be shown that the length of Y is 0 ~ Binomial(n, 1 — 
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0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 

d 


Fig. 2. The multiplicative improvement factor w.r.t. 1 — h{d) attained by our lower bound on the mutual information for an i.i.d. uniform input . For 
comparison, we also plot the improvement the lower bound from E] attains w.r.t. 1 — h{d). 

d), and given its length, Y is i.i.d. Bern(g). Thus, 

iiT(Y) = i(iJ(Y|0)+iT(0)) 

= (1 - d)/i(t7) + O 

The challenge is thus to evaluate the second term in (l46l l. which is given by 

EAlogEx,Yl(A-^ (X,Y)) =E,4logE^,Exl(A^ (X,A')) 

= E^logE^.Exl(A(X) = A'(X)) 

= E^logE^. Pr(A(X) = A'(X)) 

where A' ^ Pa such that (X, A'(X)) ~ Pxy- Note that here X, A, A' are mutually independent. 

Let us specifically choose A as in Example [3l namely we identify A with a Bern(l — d) i.i.d. vector of length n, and ^(X) 
corresponds to sampling X in the location chosen by that vector. Asymptotically, we can assume without loss of generality 
that both A and A' are drawn uniformly over vectors of weight n{l — d). This follows since for any given weight of A, the 
inner expectation w.r.t. A! only increases by replacing the i.i.d. distribution with a uniform distribution over all vectors with the 
same weight. Furthermore, the outer expectation w.r.t. A is asymptotically dominated by the uniform distribution over vectors 
of weight n{l — d). 

Let us define S to be the action that chooses only the coordinates selected by A' but not by A. Let S be the complementary 
action (that chooses only the remaining coordinates). Given any A' and A, for any assignment of the values of X in the 
coordinates chosen by S, there is either a unique assignment (j){S{'K)) of the values of X in the coordinates chosen by S that 


(47) 


(48) 
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satisfies A'(X) = A(X), or there is none. In the latter case, we set (j){S{X.)) to an arbitrary value. Thus we can write 

Pr(A(X) = A'(X)) = Pr (X e {x e {0,1}" : l(A'(x) = A(x)}) 

< Pr (X e {x e {0, ir : 5(x) = ^(Si^))}) 

= Ft{S(X) = c^(S(X))) 

= EPr(,5(X) = </.(^(X)) |5(X)) 

< E max Pr (S'(X) = u I S{X)) 

ue{o,i}isi ^ ^ ^ ivy/ 

= E max Pr (SCX.) = u) 
ue{o,i}isi 

= max Pr(S'(X)=u) 
ue{o.i}isi 

= (1-9)'"' 

Returning to (l48T l and using the above, we have 

E^logE^' Pr(A(X) =A'(X)) <EAlogE^,(l-( 7 ) 1^1 

where the only randomness is in [S'!, which is a deterministic function of A and A!. In particular, [S'] is the number of 
coordinates chosen by JF and not by A. Since A and JF were assumed to be uniformly distributed over constant weight 
vectors of weight {1 — d)n, then simple counting arguments show that for every action a 

/ {l — d)n \ / dn \ 

PrdS-l = p(l - d)n\A = a)= 

V(l —(i)n/ 


Thus, maximizing over feasible values of p 

lim —E^ logEyi/ Pr(A(X) = A'(X.)) < max (1 — d)h{p) + d-h ( p— 7 — ) — h{d) + (1 — d)plog(l — q) 
n->oo n 0<p<Y^ \ “ / 

Plugging the above in (l46l l and using (l47l) . we obtain the bound 

lim —/(X; Y) < (1 — d)h{q) — h{d) + max r(p) 
n-7oo n 


where 


r(p) - i^-d) {h{p) + plog(l -q)) + d-h 



We note that the maximization over p can be solved directly by differentiation, and the maximizing value is 

and we therefore have 


lim -/(X; Y) < (1 - d)h{q) - h{d) + r(p*). (49) 

n—>-oo Tl 

In the limit of d 1 it is easy to see that p* —> d, and direct substitution into ( l49b reveals that for q = 1/2 the upper bound 
is smaller than (1 — d)^ for large d. In ifTTl it was shown that for an i.i.d. Bern(( 7 ) input process 

lim -/(X; Y) < (1 - d) {h{q) - 2dq{l - q)) . (50) 

n—¥oc ji 

Our new upper bound is plotted in Figure [3] for g = 1/2 along with the upper bound (l50t and the trivial upper bound 1 — d. 
It is seen that for this choice of q our new bound is better than (fSOl l for all deletion probabilities. 
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Fig. 3. Our new upper bound ED plotted for g = 1/2 along with the upper bound (50) and the trivial upper bound 1 — d. 

We remark that although here we have only applied the bounds from Theorems [T] and |2] for handling deletion channels, we 
expect a similar approach to yield improved results also for insertion channels. 

D. Most Informative Boolean Function Conjecture 

Let X be an n-dimensional binary vector uniformly distributed over {0,1}”, and Y be the output of passing each component 
of X through a binary symmetric channel with crossover probability a < 1/2. Let / : {0,1}" —^ {0,1} be a boolean function. 
Following a recent conjecture by Courtade and Kumar ||6l, there has been much interest in developing useful upper bounds on 
/(/(X); Y), where the ultimate goal is to prove that this quantity is maximized by the dictatorship function /(X) = Xi for 
some i G [n]. In this subsection, we apply Theorem |2] to derive the following novel upper bound. 

Theorem 5: Let X, Z, W G {0,1}" be three statistically independent random vectors, with the entries of X i.i.d. Bern(i), 
and the entries of Z and W i.i.d. Bern(a). Let Y = X © Z. For any boolean function / : {0,1}" —> {0,1}, 

/(Y;/(X)) <Lf(/(X))+EwlogPr(/(X©W©Z) = /(X)) (51) 


Proof: Identify the action that maps Y to /(X) with drawing an i.i.d. vector W with Bern(a) entries and setting 
A(Y) = /(Y © W). The bound Q reads (discarding the last term which is non-positive) 

/(Y;/(X)) < if(/(X)) +E^logEY,/(x)l(^(Y) = /(X)) 

= H{f{X)) + Ew logEY,/(x)l(/(Y © W) = /(X)) 

= H{f{X)) + Ew log Pr(/(X © W © Z) = /(X)), (52) 


as desired 
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For a fixed w S {0,1}", let us now express Pr(/(X © w © Z) = /(X)). To this end, we use the standard isomorphism 
0 1, 1 —>• —1, © —•• Under this isomorphism we need to calculate Pr(/(X • w • Z) = /(X)), where the products between 

vectors are taken componentwise. Recall lITSl that / : {—1,1}" —{—1,1} admits the Fourier-Walsh expansion 


/w = 

SC[n] iGS 

where 

KS)^E(f{X)l[xA , 
V ies ) 


and the expectation is taken w.r.t. to i.i.d. uniform distribution on { — 1,1}. Let /w(X) 
follows from (l53T l that /w(5') = /(S') Hies have 


(53) 


(54) 

/(X-w), and note that it immediately 


Pr(/(X ■ w • Z) = /(X)) = Pr(/w(X • Z) = /(X)) 

_ l + E(/(X)/w(X-Z)) 

2 

1 + E ^X]sc[n] /(‘5') riies X]rc[ri] /('^) rijeT 
“ 2 
_ 1 + Esc[n] /^(5')(1 - 2a)l^l n.es w, 

2 ’ 

where in (1551) we have used the facts that 'K{XiXj) = 1(1 = j) and E{Zi) = (1 —2a) for any i,j G [n]. Now, substituting (l55l l 
into (l52] | gives the following corollary. 

Corollary 1: For any boolean / : {—1,1}" {~1) 1}^ 


/(Y; /(X) < iF(/(X)) - 1 + Ew log 1 + ^ /2(S)(1 - 2a)l^l l[wA . (56) 

\ SC[n] ieS / 

where 14/ are i.i.d. with Pr(14/ = —1) = 1 — Pr(14/ = 1) = a. 

We note that the upper bound from Theorem |5] and Corollary [T] are tight for the function /(X) = Xi. Thus, showing that the 
dictatorship function maximizes (l52l l or (l56l l. will settle the most informative boolean function conjecture fh). Unfortunately, 
our attempts to prove the former were not successful. 


Appendix 

Given that Ku = ki and K 2 i = ^ 2 , we know that the 1th phrase in the parsing of X is of the form 

B ■ ■ - BBB ■ ■ - BB, (57) 

k\ k2 

where B ~ Bern(|) and B = 1 — B. The r.v. Zi counts the number of bits in Y' that were matched by the greedy algorithm 
to bits in the 1th phrase of X. Thus, conditioned on the event Ku = ki,K 2 i = ^ 2 , the r.v. Zi counts the number of bits from 
an i.i.d. uniform sequence (corresponding to the relevant bits in Y') that are matched by the greedy algorithm to bits in the 
phrase (l57l i. 

Let W be the event that the first fci bits of the i.i.d. sequence are equal to B. Clearly, Pr(14^) = and if 14^ occurs 
then Zi = ki- Let Ti be the location of the first occurrence of B in the i.i.d. sequence, and let location of the first 

occurrence of B after Ti. Further, let T 2 = T 2 — Ti. For example, if the sequence of i.i.d. bits is 

B BBBBBB B..., 

then Ti = 3 and T 2 = 5, and if the sequence of i.i.d. bits is 

BBB..., 
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then Ti = 1 and To = 2 . We further define the r.v. 


To = 


T 2 T 2 < ^2 

^2 — 1 T 2 > k2 


Note that given W (the event that W did not occur), we have Zi = Ti + f2. We have 

E( 2 ‘^^ I Ku = = k2) = Pr(M^)E( 2 ‘^^ | Ku = k^K^, = ^2, fP) + Pr(W)E ( 2 *^^ | = k^,K2, = k2,W) 


2-fci2ifci + (1 - 2“''^) E (2*(T+T2) I 
2-fci2ifci + (1 - 2 “'=^) E ( 2 ^T I ]g ^ 


The r.v.s Ti and T2 are statistically independent Geometric(5), and therefore 


1 < m < fci — 1 


and 


Pr(Ti = m|VP) = {^-^ " 

0 otherwise 


2"* 1 < m < k2,m k2 — I 

Pi{T2 = m) = <( 2-'== + 2-""lL(fc2 >1) m = fc2 - 1 
0 otherwise 


This gives 


fel —1 


E 


1 

1 -T- 

1 

1 - 2-G 1 - 2*-i 


(2™) = E 

2* ^ 2fci(t-i)^ 


m—1 
■yt-1 


and for k2 > 1 


k2 


E 


m—1 

^ _ 2''2(*-i)^ _l_ 2''2p“i)” 


1 - 2‘-i 


( 58 ) 


( 59 ) 


( 60 ) 


Note that for fc2 = 1 we have E ^ 2 *^^^ ~ 1 + ^2 *, and (l60l) continues to hold. Substituting ( | 59 ] | and dhOl l into dSST l yields 
the desired expression. 
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